This project is inspired by a data set I found on Kaggle containing the names of shipwrecks, the years they occurred, their coordinates, and some notes about their sinking.
library(tidyverse)
library(plotly)
Shipwrecks <- read_csv("~/GitHub/Analyzing-Shipwrecks/Shipwrecks.csv", na = '')
glimpse(Shipwrecks)
## Rows: 3,959
## Columns: 5
## $ Ship <chr> "Black Assarca shipwreck", "Globe Star", "Gulland", "Mtong…
## $ Flag <chr> "Unknown", "Singapore", "Aden", "Tanzania", "Greece", "Roy…
## $ `Sunk date` <chr> "Early 7th century (Presumed)", "27-Apr-73", "13-Apr-51", …
## $ Notes <chr> "A wreck discovered at Black Assarca Island in 1995. It wa…
## $ Coordinates <chr> NA, "4°04′54″S 39°43′12″E", "04°02′50″S 39°43′57″E", NA, "…
The data set did not appear to contain a unique identifier. Before cleaning the data, I wanted to check if any ship names were duplicated.
Shipwrecks$Ship[duplicated(Shipwrecks$Ship)]
## [1] NA "Phoenix" NA
## [4] "HMS Defence" "U-20" "Unknown"
## [7] "K-27" "Duchess of York" "HMS Repulse"
## [10] "U-103" "HMS Hermes" "Tullaghmurray Lass"
## [13] "HMS Gloucester" "HMS Invincible" "Amsterdam"
## [16] "HMS Invincible" "HMS Niger" "Preußen"
## [19] "RMS Alaunia" "HMS Ariadne" "HMS Holland 5"
## [22] "RMS Moldavia" "Nyon" "Sitakund"
## [25] "Storaa" "U-40" "U-413"
## [28] "UC-65" "Wittering" "HMS Amphion"
## [31] "HMS Foyle" "HMS Invincible" "HMS Warrior"
## [34] "London" "Pacific" "San Pedro"
## [37] "HMS Drake" "Florence" "Kriegsmarine"
## [40] "Kriegsmarine" "Kriegsmarine" NA
## [43] NA "Auguste" "Enterprise"
## [46] "HMS Terror" "Eagle" "HMS Phoenix"
## [49] "HMCS Charlottetown" "Carrier Dove" "San Pedro"
## [52] "Montana" "USS Jacob Jones" "Vizcaya"
## [55] "Rusland" "HMS Cerberus" NA
## [58] NA NA "Phoenix"
## [61] "Hanover" "Phoenix" "Georgia"
## [64] NA NA NA
## [67] "Eagle" "San Pedro" NA
## [70] "SS Alpena" "Daniel Lyons" "Fleetwing"
## [73] "Frank O'Connor" "Grape Shot" "Hanover"
## [76] "Joys" "Louisiana" "Niagara"
## [79] "Phoenix" "Success" "Tennie and Laura"
## [82] "Three Brothers" NA "America"
## [85] "PS Anthony Wayne" "Argo" "Armenia"
## [88] "Charles Foster" "Crete" "Dean Richmond"
## [91] "Eldorado" "Indiana" "Merida"
## [94] "Morania" "Niagara" "North Carolina"
## [97] "Oneida" "Oxford" "Philip D. Armour"
## [100] "S.K. Martin" "Siberia" "Success"
## [103] "Unknown" "Hilda" "Ocean Wave"
## [106] "Unknown" "Washington" NA
## [109] NA NA NA
## [112] "Columbia" "King Philip" "Adventure"
## [115] "USS Atlanta" "Idaho" "Portland"
## [118] "SS Tiger" "Phoenix" "Enterprise"
## [121] "John Hunter" "Alpha" "Friendship"
## [124] "Occident" "Victoria" "General Butler"
## [127] "Eureka" "Sacramento" "Columbia"
## [130] "Phyllis" "New Carissa" NA
## [133] "California" "SMS Dresden" "Imo"
## [136] "Dolphin" "Novelty" "USS Atlanta"
## [139] "USS Gregory" "Adonis" "Adventure"
## [142] "Dundee" "George" "Nancy"
## [145] "Oakland" "Queen Bee" "HMAS Voyager"
## [148] "Frederick" "HMS Pandora" "Grecian"
## [151] "Maria" "Water Witch" "Venus"
## [154] "Alert" "Pioneer" "Contest"
## [157] "Elizabeth" "Ivy" "Neptune"
## [160] "HMAS Perth" "Rockingham" "HMAS Swan"
## [163] "HMAS Torrens" "Venus" "Windsor"
## [166] "SS Cap Arcona" "MS Estonia" "Graf Zeppelin"
## [169] "France" "SS Admiral Nakhimov" "Armenia"
## [172] "Novorossiysk" "Almirante Oquendo" "American"
## [175] "Castilla" "Ceres" "Cristóbal Colón"
## [178] "Cristóbal Colón" "Hector" "USS Memphis"
## [181] "Vizcaya" "Albatross" "Empire Mica"
## [184] "Gulfpenn" "Heredia" "Oaxaca"
## [187] "U-2513" "HMS A3" "HMS A3"
## [190] "CSS Alabama" "HMS Ariadne" "HMS Charybdis"
## [193] "HMS D3" "HMS Durban" "HMS Empress of India"
## [196] "HMS Formidable" "SMS Grosser Kurfürst" "HMS Hood"
## [199] "HMS Tiger" "San Juan de Pasajes" "Antikythera wreck"
## [202] "Ariane" "Bouvet" "HMHS Britannic"
## [205] "HMS Graph" "HMS Drake" "HMS Amphion"
## [208] "HMS Ardent" "HMS Arethusa" "HMS Ariel"
## [211] "HMS Black Prince" "HMS Brilliant" "HMS Bulwark"
## [214] "SMS Cöln" "HMS Defence" "HMS Defence"
## [217] "SMS Elbing" "HMS Falmouth" "SMS Frauenlob"
## [220] "HMS Hermes" "HMS Indefatigable" "HMS Intrepid"
## [223] "HMS Invincible" "SMS Lützow" "Adolf Vinnen"
## [226] "HMS Ardent" "SS Prins Olav" "HMS Antelope"
## [229] "HMS Ardent" "HMS Birkenhead" "HMS Coventry"
## [232] "HMS Hermes" "HMAS Vampire" "I-60"
## [235] "Kormoran" "USS Langley" "HMS Pegasus"
## [238] "HMS Sirius" "HMAS Sydney" "Haguro"
## [241] "Kuma"
Wow! 241 ship names were repeated. It is likely that ship names have been reused throughout history. Let’s see if any records are complete duplicates across all of our fields.
duplicates <- duplicated(Shipwrecks)
Shipwrecks[duplicates, ]
## # A tibble: 4 × 5
## Ship Flag `Sunk date` Notes Coordinates
## <chr> <chr> <chr> <chr> <chr>
## 1 Nyon Switzerland 15-Jun-62 A Swiss cargo ship that ran ag… <NA>
## 2 Storaa United Kingdom 3-Nov-43 A British coaster sunk by a Ge… <NA>
## 3 U-40 Kriegsmarine 13-Oct-39 A German submarine sunk by a m… 50°42′N 0°…
## 4 Unknown <NA> <NA> <NA> <NA>
It appears four records are exact matches. It will be best to remove them.
Shipwrecks <- unique(Shipwrecks)
We should add a unique identifier to prevent mixing up shipwrecks with the same name.
Shipwrecks <- rowid_to_column(Shipwrecks, "Ship_Id")
We want to know the year that each shipwreck occurred so that we can determine which years had more shipwrecks than others. Looking at the ‘Sunk date’ column though, it appears the dates are in all sorts of different formats. We can start by extracting years from dates in date, month, year and month, year formats. Some years from the 1900s don’t specify the century though, and our code incorrectly assumes it to be the 21st century in some records, so we need to correct that after extracting the year.
Shipwrecks <- Shipwrecks %>%
mutate(Year_Sunk = ifelse(!is.na(lubridate::dmy(Shipwrecks$`Sunk date`)), as.character(year(lubridate::dmy(Shipwrecks$`Sunk date`))),
ifelse(!is.na(lubridate::my(Shipwrecks$`Sunk date`)), as.character(year(lubridate::my(Shipwrecks$`Sunk date`))), NA)))
Shipwrecks$Year_Sunk <- ifelse(as.numeric(Shipwrecks$Year_Sunk) >= 2000, as.character(as.numeric(Shipwrecks$Year_Sunk) - 100), Shipwrecks$Year_Sunk)
We can now search the ‘Sunk date’ column for any additional records containing a 4-digit number, which we can assume to be the year the ship sunk.
Shipwrecks <- Shipwrecks %>%
mutate(Year_Sunk = ifelse(is.na(Shipwrecks$Year_Sunk) & str_detect(Shipwrecks$`Sunk date`, "\\d{4}"), str_extract(Shipwrecks$`Sunk date`, "\\d{4}"), Year_Sunk))
Note that not every record has a date for the shipwreck. We could continue applying filters to extract more years from the records that do have dates; however, each iteration is becoming more and more specific and time-intensive. For the sake of moving on with this project, we will just allow the uncategorized dates to be grouped with the records without dates.
Before we move on, we need to convert our new Year_Sunk column to numeric.
Shipwrecks$Year_Sunk <- as.numeric(Shipwrecks$Year_Sunk)
Reviewing the minimum and maximum years, it seems we have a shipwreck from the year 2200 - that’s obviously not possible. Looking at the record, it was originally entered as 2200 BC. We’ll remove it since we aren’t interested in shipwrecks that old anyways.
Shipwrecks <- Shipwrecks %>%
mutate(Year_Sunk = ifelse(Shipwrecks$Year_Sunk == 2200, NA, Shipwrecks$Year_Sunk))
Curious which years had the most shipwrecks? Let’s pull the top 10 years.
Years_With_Most_Shipwrecks <- Shipwrecks %>%
filter(!is.na(Year_Sunk)) %>%
group_by(Year_Sunk) %>%
summarise(Total_Wrecks = n()) %>%
arrange(desc(Total_Wrecks)) %>%
head(10)
Here are our results. Looks like most shipwrecks occurred during World War I and World War II.
## # A tibble: 10 × 2
## Year_Sunk Total_Wrecks
## <dbl> <int>
## 1 1942 336
## 2 1944 228
## 3 1943 125
## 4 1945 121
## 5 1940 111
## 6 1917 103
## 7 1918 89
## 8 1941 76
## 9 1916 70
## 10 1914 60
Let’s make a plot showing the number of shipwrecks each year between 1850 and 2000. We can highlight the major wars during that time period, to include the American Civil War, World War I, and World War II.
Shipwrecks_Per_Year <- ggplot(Shipwrecks, aes(x = Year_Sunk)) +
geom_bar(fill = "steelblue") +
annotate("rect", xmin = 1861 - 0.5, xmax = 1865 + 0.5, ymin = 0, ymax = 50, alpha = 0.3, fill = "red") +
annotate("rect", xmin = 1914 - 0.5, xmax = 1918 + 0.5, ymin = 0, ymax = 110, alpha = 0.3, fill = "red") +
annotate("rect", xmin = 1939 - 0.5, xmax = 1945 + 0.5, ymin = 0, ymax = 350, alpha = 0.3, fill = "red") +
xlim(1850, 2000) +
labs(title = "Number of Shipwrecks Per Year (1850-2000)", x = "Year", y = "Number of Shipwrecks") +
theme(plot.title = element_text(face = "bold", size = 14)) +
annotate(geom = "text", x = 1863, y = 60, label = "Civil War (1861-1865)") +
annotate(geom = "text", x = 1916, y = 120, label = "World War I (1914-1918)") +
annotate(geom = "text", x = 1942, y = 360, label = "World War II (1939-1945)")
Shipwrecks_Per_Year
There are definitely some peaks during those wars!
This data set also includes the latitudes and longitudes of many of the shipwrecks. Converting these from character to decimal coordinates will be tedious though…
The first step is to separate the Coordinates column into separate Latitude and Longitude columns.
Shipwrecks <- Shipwrecks %>%
separate_wider_delim(Coordinates, " ", names = c("Latitude", "Longitude"), too_few = "align_start", too_many = "drop")
Now we can apply some elbow grease. Almost all of the coordinates are in date-month-second format, but we need them to be in decimal format in order to map them.
Shipwreck_Coords <- Shipwrecks %>%
filter(!is.na(Year_Sunk) & !is.na(Latitude) & !is.na(Longitude)) %>%
mutate(Lat_Date = ifelse(grepl("N", Latitude, fixed = TRUE), as.numeric(word(Latitude, 1, sep = "°")),
ifelse(grepl("S", Latitude, fixed = TRUE), -1 * as.numeric(word(Latitude, 1, sep = "°")), NA))) %>%
mutate(Long_Date = ifelse(grepl("E", Longitude, fixed = TRUE), as.numeric(word(Longitude, 1, sep = "°")),
ifelse(grepl("W", Longitude, fixed = TRUE), -1 * as.numeric(word(Longitude, 1, sep = "°")), NA))) %>%
mutate(Lat_Minutes = ifelse(grepl("N", Latitude, fixed = TRUE), sub(".*°", "", Latitude),
ifelse(grepl("S", Latitude, fixed = TRUE), sub(".*°", "", Latitude), NA))) %>%
mutate(Lat_Minutes = ifelse(grepl("′", Lat_Minutes, fixed = TRUE), as.numeric(word(Lat_Minutes, 1, sep = "′")), 0)) %>%
mutate(Long_Minutes = ifelse(grepl("E", Longitude, fixed = TRUE), sub(".*°", "", Longitude),
ifelse(grepl("W", Longitude, fixed = TRUE), sub(".*°", "", Longitude), NA))) %>%
mutate(Long_Minutes = ifelse(grepl("′", Long_Minutes, fixed = TRUE), as.numeric(word(Long_Minutes, 1, sep = "′")), 0)) %>%
mutate(Lat_Seconds = ifelse(grepl("N", Latitude, fixed = TRUE), sub(".*′", "", Latitude),
ifelse(grepl("S", Latitude, fixed = TRUE), sub(".*′", "", Latitude), NA))) %>%
mutate(Lat_Seconds = ifelse(grepl("″", Lat_Seconds, fixed = TRUE), as.numeric(word(Lat_Seconds, 1, sep = "″")), 0)) %>%
mutate(Long_Seconds = ifelse(grepl("E", Longitude, fixed = TRUE), sub(".*′", "", Longitude),
ifelse(grepl("W", Longitude, fixed = TRUE), sub(".*′", "", Longitude), NA))) %>%
mutate(Long_Seconds = ifelse(grepl("″", Long_Seconds, fixed = TRUE), as.numeric(word(Long_Seconds, 1, sep = "″")), 0)) %>%
mutate(Lat = ifelse(grepl("N", Latitude, fixed = TRUE), Lat_Date + Lat_Minutes / 60 + Lat_Seconds / 3600,
ifelse(grepl("S", Latitude, fixed = TRUE), Lat_Date + -1 * Lat_Minutes / 60 + -1 * Lat_Seconds / 3600, NA))) %>%
mutate(Long = ifelse(grepl("E", Longitude, fixed = TRUE), Long_Date + Long_Minutes / 60 + Long_Seconds / 3600,
ifelse(grepl("W", Longitude, fixed = TRUE), Long_Date + -1 * Long_Minutes / 60 + -1 * Long_Seconds / 3600, NA))) %>%
select(Ship_Id, Ship, Year_Sunk, Lat, Long)
Now we should be ready to map these coordinates! First we should define our geo argument for our future map.
World_Map <- list(
scope = 'world',
showland = TRUE,
landcolor = toRGB("LightBlue"),
showcountries = TRUE,
showocean = TRUE,
oceancolor = toRGB("gray95"),
showlakes = TRUE,
lakecolor = toRGB("gray95"),
showrivers = TRUE,
rivercolor = toRGB("gray95"))
Next, since we are only interested in shipwrecks from 1850 to 2000, we will create a separate table for just that date range.
Shipwreck_Coords_1850_2000 <- Shipwreck_Coords %>%
filter(Year_Sunk >= 1850 & Year_Sunk <= 2000)
Finally, we can project the locations of shipwrecks onto a map of the world.
Shipwreck_Map <- plot_geo(Shipwreck_Coords_1850_2000, lat = ~Lat, lon = ~Long) %>%
add_markers(text = ~paste(Ship, paste("Year Sunk: ", Year_Sunk), sep = "<br />"),
color = ~Year_Sunk, colors = "Reds", symbol = I("square"), size = I(8), hoverinfo = "text") %>%
colorbar(title = "Year Sunk", x = 1, y = 0.8) %>%
layout(title = list(text = "Shipwrecks Around the World", y = 0.9),
margin = list(b = 0, t = 0, l = 0, r = 0),
geo = World_Map)
Shipwreck_Map
Hopefully you enjoyed studying these shipwrecks from throughout history. It appears our mapping was a bit off in some places (last I checked, ships are not supposed to go on land). Perhaps this is due to the data set being inaccurate, or perhaps there is some disparity between the decimal coordinates we calculated and the way the world map is projected. Regardless, we were able to get a good sense of both when and where these shipwrecks occurred. Thanks for reading!